Skip to content

Release v1.18.0#2908

Merged
xingyaoww merged 3 commits intomainfrom
rel-1.18.0
Apr 21, 2026
Merged

Release v1.18.0#2908
xingyaoww merged 3 commits intomainfrom
rel-1.18.0

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot commented Apr 21, 2026

Release v1.18.0

This PR prepares the release for version 1.18.0.

Release Checklist

  • Version set to 1.18.0
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.18.0
    • Select branch: rel-1.18.0
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:45ab866-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-45ab866-python \
  ghcr.io/openhands/agent-server:45ab866-python

All tags pushed for this build

ghcr.io/openhands/agent-server:45ab866-golang-amd64
ghcr.io/openhands/agent-server:45ab866-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:45ab866-golang-arm64
ghcr.io/openhands/agent-server:45ab866-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:45ab866-java-amd64
ghcr.io/openhands/agent-server:45ab866-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:45ab866-java-arm64
ghcr.io/openhands/agent-server:45ab866-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:45ab866-python-amd64
ghcr.io/openhands/agent-server:45ab866-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:45ab866-python-arm64
ghcr.io/openhands/agent-server:45ab866-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:45ab866-golang
ghcr.io/openhands/agent-server:45ab866-java
ghcr.io/openhands/agent-server:45ab866-python

About Multi-Architecture Support

  • Each variant tag (e.g., 45ab866-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 45ab866-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Apr 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Copy link
Copy Markdown
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean version bump for v1.18.0 release.

All package versions consistently updated from 1.17.0 → 1.18.0, eval workflow default updated to match, and uv.lock properly reflects the changes. LGTM! 🚀

(Would approve, but GitHub doesn't allow approving your own PR)

Copy link
Copy Markdown
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Release preparation complete: all four packages successfully bumped from v1.17.0 to v1.18.0 with consistent versioning across pyproject.toml files, lockfile, and workflow defaults.

Does this PR achieve its stated goal?

Yes. The PR's stated goal is to "prepare the release for version 1.18.0" by updating version numbers from 1.17.0 to 1.18.0. The changes successfully:

  1. Update all four package versions consistently (sdk, tools, workspace, agent-server)
  2. Update the run-eval.yml workflow default from v1.17.0 to v1.18.0
  3. Synchronize the uv.lock with the new versions
  4. Maintain backward compatibility (no deprecation deadlines for this release)

All version-related files are correctly updated, the lockfile is synchronized, packages install successfully, and runtime version reporting works as expected.

Phase Result
Environment Setup ✅ Dependencies installed, 233 packages in 628ms
CI & Tests ✅ Core tests passing (sdk, agent-server, workspace, cross, pre-commit, package version check)
Functional Verification ✅ Version consistency verified, runtime checks pass, basic SDK functionality works
Functional Verification

Test 1: Version Consistency Across All Packages

Step 1 — Establish baseline (main branch at 1.17.0):

Checked all package versions on main branch:

$ git show main:openhands-sdk/pyproject.toml | grep "^version"
version = "1.17.0"

$ git show main:openhands-tools/pyproject.toml | grep "^version"
version = "1.17.0"

$ git show main:openhands-workspace/pyproject.toml | grep "^version"
version = "1.17.0"

$ git show main:openhands-agent-server/pyproject.toml | grep "^version"
version = "1.17.0"

$ git show main:.github/workflows/run-eval.yml | grep -A 1 "default:"
default: v1.17.0

This confirms the baseline is 1.17.0 across all packages.

Step 2 — Apply the PR's changes:

Checked out rel-1.18.0 branch (commit a937440).

Step 3 — Verify version bump:

Checked all package versions on the release branch:

$ grep "^version" openhands-*/pyproject.toml
openhands-agent-server/pyproject.toml:version = "1.18.0"
openhands-sdk/pyproject.toml:version = "1.18.0"
openhands-tools/pyproject.toml:version = "1.18.0"
openhands-workspace/pyproject.toml:version = "1.18.0"

$ grep -A 1 "default:" .github/workflows/run-eval.yml
default: v1.18.0

All four packages are consistently bumped to 1.18.0, and the workflow default is updated.

Step 4 — Verify lockfile consistency:

$ grep 'name = "openhands-' uv.lock -A 1 | grep version
version = "1.18.0"
version = "1.18.0"
version = "1.18.0"
version = "1.18.0"

$ uv lock --locked
Resolved 402 packages in 1ms

Lockfile is synchronized with pyproject.toml files (all at 1.18.0).


Test 2: Runtime Version Reporting

Step 1 — Install packages:

$ make build
Installing dependencies with uv sync --dev...
Installed 233 packages in 628ms
 + openhands-agent-server==1.18.0
 + openhands-sdk==1.18.0
 + openhands-tools==1.18.0
 + openhands-workspace==1.18.0

Packages installed successfully at version 1.18.0.

Step 2 — Verify runtime version reporting:

$ python -c "import openhands.sdk; print(f'SDK: {openhands.sdk.__version__}')"
SDK: 1.18.0

$ python -c "import openhands.tools; print(f'Tools: {openhands.tools.__version__}')"
Tools: 1.18.0

SDK and tools correctly report version 1.18.0 at runtime (workspace and agent-server don't expose __version__ by design).


Test 3: Basic SDK Functionality

Step 1 — Test agent creation:

from openhands.sdk import Agent
from openhands.sdk.llm import LLM

llm = LLM(model="gpt-4o-mini")
agent = Agent(
    llm=llm,
    system_prompt="You are a helpful assistant.",
)

Result:

✓ Agent creation successful
✓ Agent LLM model: gpt-4o-mini
✓ Agent system_prompt set: True
✓ Basic SDK functionality verified

Core SDK functionality works correctly with the new version.


Test 4: Deprecation Deadline Check

Step 1 — Search for deprecations scheduled for removal in 1.18.0:

$ grep -h "removed_in" --include="*.py" -r openhands-* | sort -u
removed_in="1.19.0",
removed_in="1.20.0",
removed_in="1.22.0",
removed_in="1.23.0",
removed_in="2.0.0",
removed_in=None,

No deprecations are scheduled for removal in 1.18.0. The earliest removal is 1.19.0 (next release), which is correct.


Test 5: CI Status Review

Completed and passing:

  • ✅ Check package versions (critical for release)
  • ✅ pre-commit
  • ✅ sdk-tests
  • ✅ agent-server-tests
  • ✅ workspace-tests
  • ✅ cross-tests
  • ✅ build-binary-and-test (ubuntu-latest)
  • ✅ Python API
  • ✅ REST API (OpenAPI)
  • ✅ Check OpenAPI Schema
  • ✅ Some integration tests (claude-sonnet-4-6, gemini-3.1-pro)

Still in progress:

  • Build & Push (various architectures)
  • Additional integration tests
  • qa-changes (this report)

Core functionality tests have all passed. Remaining checks are builds and extended integration tests.

Issues Found

None. The release preparation is complete and ready for the next steps in the release checklist.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL24139560576% 
report-only-changed-files is enabled. No files were changed during this commit :)

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 94.1%
Total Cost: $1.20
Models Tested: 4
Timestamp: 2026-04-21 15:39:15 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 8/8 1 9 $0.13 480,587
litellm_proxy_deepseek_deepseek_reasoner 100.0% 8/8 1 9 $0.04 706,653
litellm_proxy_gemini_3.1_pro_preview 88.9% 8/9 0 9 $0.45 292,670
litellm_proxy_anthropic_claude_sonnet_4_6 88.9% 8/9 0 9 $0.59 405,015

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.13
  • Token Usage: prompt: 473,663, completion: 6,924, cache_read: 385,280
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_a937440_kimi_k2_thinking_run_N9_20260421_153413
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.04
  • Token Usage: prompt: 693,263, completion: 13,390, cache_read: 633,856, reasoning: 5,467
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_a937440_deepseek_v3_2_reasoner_run_N9_20260421_153408
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 88.9% (8/9)
  • Total Cost: $0.45
  • Token Usage: prompt: 288,321, completion: 4,349, cache_read: 101,959, reasoning: 2,949
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_a937440_gemini_3_1_pro_run_N9_20260421_153406

Failed Tests:

  • t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: inline_image_0.png (Cost: $0.03)

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 88.9% (8/9)
  • Total Cost: $0.59
  • Token Usage: prompt: 398,155, completion: 6,860, cache_read: 292,237, cache_write: 105,588, reasoning: 1,203
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_a937440_claude_sonnet_4_6_run_N9_20260421_153409

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.06)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-04-21 15:59:31 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 23.4s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 19.8s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.5s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 37.3s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 13.0s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 48.4s $0.05
01_standalone_sdk/11_async.py ✅ PASS 31.4s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 12.1s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 30.7s $0.03
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 18s $0.16
01_standalone_sdk/17_image_input.py ✅ PASS 16.4s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 19.3s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 11.9s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 21.4s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 9.5s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 13.4s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 43.2s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 47s $0.26
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 21s $0.08
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 20.4s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 30.1s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 44.8s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 8.5s $0.01
01_standalone_sdk/31_iterative_refinement.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 18.7s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 10.1s $0.01
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 3.4s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 33.2s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 10.5s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 25.6s $0.13
01_standalone_sdk/41_task_tool_set.py ✅ PASS 25.6s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 42s $0.11
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 3.1s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 6.9s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 3m 11s $0.36
01_standalone_sdk/46_agent_settings.py ✅ PASS 9.8s $0.01
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 3.1s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 12.0s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 30.7s $0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 33s $0.05
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 54.2s $0.06
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 40s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 30.2s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 13s $0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 1m 17s $0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 27.9s $0.05
02_remote_agent_server/11_conversation_fork.py ✅ PASS 34.1s $0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 14.7s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 48.9s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 10.7s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 14.3s $0.02

❌ Some tests failed

Total: 52 | Passed: 50 | Failed: 2 | Total Cost: $2.16

Failed examples:

  • examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/34_critic_example.py: Timed out after 600 seconds

View full workflow run

@xingyaoww
Copy link
Copy Markdown
Collaborator

@OpenHands check the failed test #2908 (comment)

Should i be worried?

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 21, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link
Copy Markdown
Collaborator

No, you should not be worried. These are pre-existing flaky timeouts, not caused by this PR.

Analysis

The two failed examples both timed out after the 600-second limit:

  • 31_iterative_refinement.py — COBOL-to-Java iterative refactoring (multi-agent, multi-iteration)
  • 34_critic_example.py — Critic-model iterative refinement (multi-agent, multi-iteration)

Why this is not concerning

  1. This PR only changes version numbers — the diff is limited to pyproject.toml version bumps and uv.lock updates. No example code, SDK logic, or tool code was modified.

  2. 34_critic_example is a known flaky test — it has been timing out across multiple branches in recent days:

  3. These are inherently time-sensitive LLM-driven examples — both involve multiple LLM iterations with quality thresholds. Whether they complete within 600s depends on model response latency and how many refinement iterations the LLM needs, which varies run-to-run.

Summary

Example This PR Apr 20 main Apr 20 fork-skills Apr 19 main
31_iterative_refinement ❌ timeout ✅ pass ✅ pass ✅ pass
34_critic_example ❌ timeout ✅ pass ❌ timeout ❌ timeout

These failures are LLM latency/behavior dependent and unrelated to the release changes. The 50/52 pass rate (96.2%) is consistent with recent runs.

This comment was generated by an AI assistant (OpenHands) on behalf of the user.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 21, 2026

Summary

The user asked me to check the failed tests on PR #2908 and whether they should be worried. I investigated and answered the question by posting a comment on the PR.

Answer

No, you should not be worried. The two failed example tests are pre-existing flaky timeouts unrelated to the PR:

  • 31_iterative_refinement.py and 34_critic_example.py both timed out after 600 seconds. These are complex, multi-iteration LLM-driven examples whose completion time depends on model response latency and how many refinement loops the LLM needs.

  • The PR only changes version numbers in pyproject.toml files and uv.lock — no SDK code, example code, or tool logic was modified.

  • 34_critic_example is a known flaky test — it timed out on main (Apr 19) and on another branch (vasco/fork-skills, Apr 20) in recent days, confirming this is not caused by the release PR.

  • The 50/52 pass rate (96.2%) is consistent with recent runs.

I posted this analysis as a detailed comment on the PR at: #2908 (comment)

Checklist

  • Investigated the specific failed tests referenced in the comment
  • Checked the PR diff to confirm no relevant code changes
  • Cross-referenced with historical test runs to establish the flaky pattern
  • Posted a clear, evidence-based answer on the PR
  • No code changes were made (none were needed — this was a question, not a code fix request)

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 86.7%
Total Cost: $3.84
Models Tested: 3
Timestamp: 2026-04-21 18:33:02 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_moonshot_kimi_k2_thinking 80.0% 4/5 0 5 $0.89 3,805,149
litellm_proxy_deepseek_deepseek_reasoner 80.0% 4/5 0 5 $0.51 7,621,203
litellm_proxy_anthropic_claude_sonnet_4_6 100.0% 5/5 0 5 $2.43 3,056,660

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 80.0% (4/5)
  • Total Cost: $0.89
  • Token Usage: prompt: 3,761,286, completion: 43,863, cache_read: 3,474,944
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_a937440_kimi_k2_thinking_run_N5_20260421_153415

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the primary requested deliverable - a standalone Python training script at examples/tutorial/smolvla/train_smolvla_example.py that mirrors the SmolVLA command-line training functionality. The script quality is excellent: it properly follows the pattern of existing training examples (diffusion and ACT), correctly handles SmolVLA-specific requirements (delta timestamps, flow matching, preprocessor/postprocessor), and includes clear documentation and comments.

However, there is a violation of explicit evaluation criteria. The instructions stated: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

The agent created three files:

  1. train_smolvla_example.py (explicitly requested)
  2. README.md (acceptable - one additional documentation file)
  3. COMPARISON.md (not acceptable - exceeds the allowance)

While the COMPARISON.md file is well-written and provides helpful context, it was not requested and violates the constraint that only a single README.md is acceptable as an additional file. The agent should have stopped after creating the script and README.md.

The main deliverable quality is high, but the creation of the unrequested COMPARISON.md file represents a failure to follow explicit constraints on file creation scope, even though the overall intent and execution were sound. (confidence=0.72) (Cost: $0.18)

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 80.0% (4/5)
  • Total Cost: $0.51
  • Token Usage: prompt: 7,543,252, completion: 77,951, cache_read: 7,165,056, reasoning: 28,810
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_a937440_deepseek_v3_2_reasoner_run_N5_20260421_153356

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task—updating MAX_CMD_OUTPUT_SIZE from 30,000 to 20_000 and running the targeted terminal tool tests. However, it violated the evaluation criteria in several ways:

Scope Violation:
The user explicitly requested only adjusting the terminal tool truncation limit. The agent unilaterally decided to also change the LLM default max_message_chars from 30_000 to 20_000 based on a code comment stating they "match." While the agent's reasoning about consistency is understandable, making this additional change without user approval exceeded the stated scope. The agent should have either: (1) only made the requested change, or (2) reported that changing only the terminal limit would break the stated consistency and asked for clarification.

Over-Verification:
The evaluation criteria explicitly warn against "over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly." The agent:

  • Ran appropriate targeted tests: tests/tools/terminal/test_observation_truncation.py
  • Then ran tests/sdk/config/test_llm_config.py::test_llm_config_defaults (necessary due to unauthorized change)
  • Then ran entire tests/sdk/config/test_llm_config.py file (15 tests—broader than necessary)
  • Attempted broader tests/tools/terminal/ suite
  • Verified the constant value multiple times across different Python environments
  • Performed multiple redundant grep searches for references

After the initial targeted test passed, additional verification was unnecessary and wasteful.

Lack of Clear Stopping Point:
The agent did not stop after reporting the requested change and invite further direction. Instead, it made autonomous decisions beyond the stated scope.

Positive Aspects:

  • The core task (updating MAX_CMD_OUTPUT_SIZE and verifying with terminal tests) was done correctly
  • The agent properly used uv as instructed and avoided the other checkout
  • File editing was accurate and well-documented
  • The reasoning about consistency shows thoughtful analysis, even if out of scope

Critical Issue:
While the unauthorized LLM change happens to be reasonable and all tests pass, an agent making unasked-for modifications to other components (beyond terminal tool) violates the principle of bounded scope and could cause unexpected side effects in production code. (confidence=0.75) (Cost: $0.08)

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 100.0% (5/5)
  • Total Cost: $2.43
  • Token Usage: prompt: 3,006,878, completion: 49,782, cache_read: 2,695,778, cache_write: 227,000, reasoning: 6,580
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_a937440_claude_sonnet_4_6_run_N5_20260421_153354

@github-actions
Copy link
Copy Markdown
Contributor

Evaluation Triggered

  • Trigger: Release v1.18.0
  • SDK: a937440
  • Eval limit: 50
  • Models: claude-sonnet-4-5-20250929

@xingyaoww
Copy link
Copy Markdown
Collaborator

@OpenHands apologize from me and resolve the conflict.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 21, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@xingyaoww
Copy link
Copy Markdown
Collaborator

^😓 "merge from main"

ugh, speech to text

Resolved the uv.lock conflict by keeping the release branch's newer
exclude-newer date (2026-04-14) and regenerating the lockfile.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Collaborator

Sorry about the merge conflict! 🙏

I've resolved it — the conflict was in uv.lock due to differing exclude-newer timestamps between this branch and main. I kept the release branch's newer date and regenerated the lockfile.

The PR should now be conflict-free.

This comment was created by an AI assistant (OpenHands) on behalf of @xingyaoww.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 21, 2026

Summary of Work

The PR comment asked me to apologize on behalf of the user and resolve the merge conflict on PR #2908. Here's how both requests were addressed:

Checklist

  • Resolve the merge conflict: The only conflict was in uv.lock — a trivial exclude-newer timestamp difference between the release branch (2026-04-14) and main (2026-04-11). I merged main into rel-1.18.0, kept the release branch's newer timestamp, regenerated the lockfile with uv lock, and pushed the fix.
  • Apologize on behalf of the user: Posted a comment on the PR apologizing for the conflict and explaining the resolution, attributed to @xingyaoww.

Conciseness

The changes are minimal and targeted — only the uv.lock file was touched (via the merge commit), which is exactly what was needed. No extraneous changes were made.

Take uv.lock from main and regenerate to pick up the 1.18.0 version
bumps. This ensures the lockfile reflects main's dependency state with
only the release version changes on top.

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww enabled auto-merge (squash) April 21, 2026 19:18
@xingyaoww xingyaoww merged commit e631abd into main Apr 21, 2026
25 of 26 checks passed
@xingyaoww xingyaoww deleted the rel-1.18.0 branch April 21, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants